continuous 0
- North America > United States (0.04)
- Europe > Finland (0.04)
- Asia > Singapore (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.92)
Stick-Breaking Embedded Topic Model with Continuous Optimal Transport for Online Analysis of Document Streams
Granese, Federica, Villata, Serena, Bouveyron, Charles
Online topic models are unsupervised algorithms to identify latent topics in data streams that continuously evolve over time. Although these methods naturally align with real-world scenarios, they have received considerably less attention from the community compared to their offline counterparts, due to specific additional challenges. To tackle these issues, we present SB-SETM, an innovative model extending the Embedded Topic Model (ETM) to process data streams by merging models formed on successive partial document batches. To this end, SB-SETM (i) leverages a truncated stick-breaking construction for the topic-per-document distribution, enabling the model to automatically infer from the data the appropriate number of active topics at each timestep; and (ii) introduces a merging strategy for topic embed-dings based on a continuous formulation of optimal transport adapted to the high dimensionality of the latent topic space. Numerical experiments show SB-SETM outperforming baselines on simulated scenarios. We extensively test it on a real-world corpus of news articles covering the Russian-Ukrainian war throughout 2022-2023.
- Government > Military (1.00)
- Law Enforcement & Public Safety (0.93)
- Law (0.93)
- Information Technology (0.67)
- North America > United States (0.04)
- Europe > Finland (0.04)
- Asia > Singapore (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.92)
dnamite: A Python Package for Neural Additive Models
Van Ness, Mike, Udell, Madeleine
Additive models offer accurate and interpretable predictions for tabular data, a critical tool for statistical modeling. Recent advances in Neural Additive Models (NAMs) allow these models to handle complex machine learning tasks, including feature selection and survival analysis, on large-scale data. This paper introduces dnamite, a Python package that implements NAMs for these advanced applications. dnamite provides a scikit-learn style interface to train regression, classification, and survival analysis NAMs, with built-in support for feature selection. We describe the methodology underlying dnamite, its design principles, and its implementation. Through an application to the MIMIC III clinical dataset, we demonstrate the utility of dnamite in a real-world setting where feature selection and survival analysis are both important. The package is publicly available via pip and documented at dnamite.readthedocs.io.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Israel (0.04)
- Health & Medicine > Therapeutic Area (0.68)
- Health & Medicine > Public Health (0.46)
Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search
Dainese, Nicola, Merler, Matteo, Alakuijala, Minttu, Marttinen, Pekka
In this work we consider Code World Models, world models generated by a Large Language Model (LLM) in the form of Python code for model-based Reinforcement Learning (RL). Calling code instead of LLMs for planning has the advantages of being precise, reliable, interpretable, and extremely efficient. However, writing appropriate Code World Models requires the ability to understand complex instructions, to generate exact code with non-trivial logic and to self-debug a long program with feedback from unit tests and environment trajectories. To address these challenges, we propose Generate, Improve and Fix with Monte Carlo Tree Search (GIF-MCTS), a new code generation strategy for LLMs. To test our approach, we introduce the Code World Models Benchmark (CWMB), a suite of program synthesis and planning tasks comprised of 18 diverse RL environments paired with corresponding textual descriptions and curated trajectories. GIF-MCTS surpasses all baselines on the CWMB and two other benchmarks, and we show that the Code World Models synthesized with it can be successfully used for planning, resulting in model-based RL agents with greatly improved sample efficiency and inference speed.
Residue-Based Natural Language Adversarial Attack Detection
Deep learning based systems are susceptible to adversarial attacks, where a small, imperceptible change at the input alters the model prediction. However, to date the majority of the approaches to detect these attacks have been designed for image processing systems. Many popular image adversarial detection approaches are able to identify adversarial examples from embedding feature spaces, whilst in the NLP domain existing state of the art detection approaches solely focus on input text features, without consideration of model embedding spaces. This work examines what differences result when porting these image designed strategies to Natural Language Processing (NLP) tasks - these detectors are found to not port over well. This is expected as NLP systems have a very different form of input: discrete and sequential in nature, rather than the continuous and fixed size inputs for images. As an equivalent model-focused NLP detection approach, this work proposes a simple sentence-embedding "residue" based detector to identify adversarial examples. On many tasks, it out-performs ported image domain detectors and recent state of the art NLP specific detectors.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Oregon (0.04)
- (3 more...)
- Information Technology > Security & Privacy (1.00)
- Government > Military (1.00)
Clustering and Semi-Supervised Classification for Clickstream Data via Mixture Models
Gallaugher, Michael P. B., McNicholas, Paul D.
Finite mixture models have been used for unsupervised learning for some time, and their use within the semi-supervised paradigm is becoming more commonplace. Clickstream data is one of the various emerging data types that demands particular attention because there is a notable paucity of statistical learning approaches currently available. A mixture of first-order continuous time Markov models is introduced for unsupervised and semi-supervised learning of clickstream data. This approach assumes continuous time, which distinguishes it from existing mixture model-based approaches; practically, this allows account to be taken of the amount of time each user spends on each webpage. The approach is evaluated, and compared to the discrete time approach, using simulated and real data.
- North America > United States > Texas (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- North America > United States > California > Alameda County > Hayward (0.04)
- (2 more...)